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Abstract 

A  modification  of  the  kernel  estimator  for  density  estimation  is  proposed  which 
allows  the  incorporation  of  local  information  about  the  smoothness  of  the  density. 
The  estimator  uses  a  small  set  of  local  bandwidths  rather  than  a  single  global  one 
as  in  the  standard  kernel  estimator.  It  uses  a  set  of  filtering  functions  which  deter¬ 
mine  the  extent  of  influence  of  the  local  bandwidths.  Various  versions  of  the  idea 
are  discussed.  The  estimator  is  shown  to  be  consistent  and  is  illustrated  by  compar¬ 
ison  to  the  single  bandwidth  kernel  estimator  for  the  case  in  which  the  filter  func- 


1.  INTRODUCTION 

The  kernel  density  estimator  has  been  studied  widely  since  its  introduction  in 
Rosenblatt  1956  and  Parzen  1962.  Given  i.i.d.  data  xlv..,xn  drawn  from  the 
unknown  density  a,  the  standard  kernel  estimator  is  the  single  bandwidth  estima¬ 
tor: 


X-X; 


h 


(1) 


i  =  l 

Much  work  has  been  done  on  selecting  the  optimal  bandwidth  h  under  differ¬ 
ent  assumptions  on  a  or  different  optimality  criteria  (see  the  recent  books  by  Sil¬ 
verman  1986  and  Scott  1992,  and  the  bibliographies  contained  therein,  for  a  good 
introduction  to  kernel  estimators  and  bandwidth  selection).  Alternatively,  variable 
bandwidth  kernel  estimators  are  of  the  form 


f  x  —  x^ 


n  w  h  • 
i  =  1  1 


k ; 


(2) 
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or  variations  on  this  theme.  One  then  requires  a  choice  of  many  bandwidths,  and 
several  approaches  have  been  investigated.  The  obvious  problem  which  may  arise 
in  these  variable  bandwidth  estimators  is  that  it  is  not  always  clear  how  to  best 
incorporate  a  priori  information  about  the  local  smoothness  of  the  density  into 
these  estimators.  Furthermore,  these  estimators  usually  break  down  in  the  tails 
where  the  data  is  sparse,  and  hence  it  is  difficult  to  get  good  estimates  of  appropri¬ 
ate  local  bandwidths. 

We  propose  a  modification  to  the  standard  kernel  estimator  (1),  first  introduced 
in  Rogers,  Priebe,  and  Solka  1993,  which  uses  a  small  number  of  bandwidths 
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instead  of  either  extreme  exemplified  by  equations  (1)  and  (2). 

Suppose  we  wish  to  have  a  small  number  of  bandwidths  where  each  bandwidth 
is  associated  with  a  region  of  the  support  of  the  density.  To  each  bandwidth  we 
associate  a  function  which  “filters”  the  data,  in  a  sense  to  be  described.  Basically, 
the  filter  will  define  the  extent  to  which  each  local  bandwidth  is  to  be  used  for  any 
particular  data  point.  We  can  then  construct  a  kernel  estimator  which  is  a  combina¬ 
tion  of  the  kernel  estimators  constructed  using  each  bandwidth,  with  the  data  fil- 

rn 

tered  by  the  filtering  functions.  To  be  specific,  consider  a  set  of  functions  {  p  } 

J  -  i 

where  0<pj(x)<l  for  all  x, 

m 

Xp,(*)=l  (3) 

/'=! 

for  all  x  as  will  be  seen  below.  The  p  functions  can  be  interpreted  as  posterior 
probabilities  and  are  used  to  incorporate  prior  information  concerning  local 
smoothness.  We  will  refer  to  the  p  functions  as  filtering  functions.  Associate  to 
each  filtering  function  pj  a  bandwidth  hj  such  that 


0  <hj 
hj^>  0 
nhj  —>  °° 


m 

as  n->°°.  The  filtered  kernel  estimator  (FKE)  for  the  filter  {  p  }  is 

j=  i 


n  m 

«<*>  =  x 


t-  =  \j  =  i 


(4) 


(5) 
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The  filtered  kernel  estimator  comes  from  the  following:  given  a  finite  mixture 


fix)  =  £ji fj(x)  (6) 

7=1 

and  data  fx,}  with  unknown  density  oc(x),  the  kernel  estimator  filtered  by  the  mix¬ 
ture  f  is  defined  to  be  (6)  where 


P j  (x) 


n/j(x) 

1W 


(7) 


The  idea  is  to  use  a  value  of  h  for  each  component  of  f  which  is  in  some  sense 
optimal  for  that  component  under  the  overall  mixture  model  (6)  and  thus  vary  the 
bandwidth  according  to  the  individual  variances  of  the  filtering  mixture.  It  is 
appealing  to  make  use  of  the  posterior  probability  of  component  membership  as 
the  local  contribution  for  a  given  bandwidth.  In  practice,  one  would  fit  a  mixture  to 
the  data  which  one  felt  was  a  good  representative  of  the  local  variance  of  the 
underlying  distribution,  then  use  the  mixture  to  construct  bandwidths  and  a  filtered 
kernel  estimator.  This  approach  works  well  even  when  the  data  is  not  distributed  as 
a  given  finite  mixture,  provided  that  the  mixture  captures  enough  of  the  local  vari¬ 
ance  characteristics  of  the  data.  Unfortunately,  as  will  be  seen,  the  calculation  of 
the  bandwidths  hj  is  not  as  simple  as  that  for  the  standard  kernel  estimator  (SKE), 
and  requires  the  minimization  of  a  function  whose  solutions  are  not  known  in 
closed  form. 

The  pj(xj)  term  in  equation  (5)  weights  the  contribution  of  the  kernel  centered 
at  xj  by  its  posterior  component  membership.  This  is  shown  in  Figure  1,  where  a 
two  component  mixture  density,  an  illustrative  selection  of  kernels  weighted  by  the 


4 


pj  functions,  and  the  pj  functions  themselves  are  shown. 

An  alternative  to  this  formualtion  can  be  obtained  by  considering  a  mixture  of 
kernel  estimators  in  which  the  posterior  probabilities  Pj(x)  at  the  point  being  esti¬ 
mated  play  the  role  of  the  mixing  coefficients.  This  second  approach  allows  the 
incorporation  of  information  about  the  support  of  the  density.  This  estimator  is 


“<*>  - 

j = i  v  i  k  j  jj 

which  can  be  rewritten  in  the  form  of  the  filtered  kernel  estimator  as 


1  ”  ™  pfa)  f  X-X:  ) 

i  =  Ij  =  1  J  V  j  J 

We  incorporate  information  of  the  support  of  a  by  the  condition  that  Pj(x)  =  0 
where  a  is  known  to  vanish.  We  must  have 


n  m 

1  V>  V  1 


(10) 

i=\j=l  j  v  J  J 

in  order  to  guarantee  that  the  estimate  is  a  density.  Since  the  proportions  are  not 
fixed,  but  are  functions  of  x,  we  have  a  potentially  different  mixture  for  each  x, 
which  allows  the  incorporation  of  local  scaling  of  the  estimator. 

We  draw  explicit  attention  to  the  dichotomous  views  demonstrated  in  (5)  and 
(9).  The  estimator  in  (5)  attributes  the  posterior  component  membership  to  the  data 
point  at  which  we  center  the  kernel  while  (9)  focuses  on  the  component  member¬ 
ship  at  the  point  at  which  we  are  computing  the  functional  estimate.  This  type  of 
dichotomy  is  inherent  in  the  standard  kernel  estimator  but  in  this  case  leads  to  dis- 
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tinct  estimators. 

We  will  assume  throughout  the  formulation  (5)  for  the  filtered  kernel  estimator, 
unless  otherwise  noted.  However,  (9)  might  be  of  interest  for  those  situations 
where  the  density  is  known  to  vanish,  for  instance  for  densities  of  known  support. 
A  combination  of  the  two  estimators  (5)  and  (9)  may  in  some  situations  be  desired, 
but  we  not  be  pursued  here. 

Although  we  are  concerned  here  with  univariate  densities,  the  filtered  kernel 
estimator  a  (*)  has  an  interesting  extension  to  multivariate  densities.  Assume  that 
the  kernel  is  a  normal  density  and  that  the  mixture  (6)  is  a  mixture  of  normals.  For 
each  local  bandwidth  hj,  we  associate  both  the  posterior  probabilities  from  the 
mixture  (the  filtering  function)  and  the  covariance  of  the  jth  component  Lj.  Then 
for  each  j,  K  is  replaced  with  Kj,  which  is  a  normal  with  covariance  Ej.  Thus  we 
can  take  into  account  local  structure  as  represented  by  the  mixture  approximation 
to  the  density. 

As  always,  we  do  not  get  something  for  nothing,  and  the  filtered  kernel  estima¬ 
tor  is  no  exception.  Although  the  asymptotics  make  almost  no  restrictions  on  the 
choice  of  the  filters  and  bandwidths,  for  finite  samples  these  can  be  critical.  In 
order  for  the  filtered  kernel  estimator  to  provide  any  improvement  over  a  single 
bandwidth  kernel  estimator  (or  anything  else)  we  require  filtering  functions  and 
local  bandwidths  which  are  appropriate  for  the  density  to  be  estimated.  As  will  be 
shown  in  the  examples  below,  this  method  works  very  well  for  densities  that  are 
approximated  reasonably  well  by  a  mixture  model,  provided  one  has  a  good 
method  for  estimating  the  mixture  model. 
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2.  ASYMPTOTICS 


Assume  the  conditions  on  the  hj’s  in  eqn  (4).  Assume  further  that  K(t)  is  a 
bounded  density  with  zero  mean  and  finite  second  moment  k2,  that  is, 

\tK(t)dt  =  0 

,  J  (ID 

J  t2K(t)dt  =  k2<oo 

Thm  1:  Under  the  above  conditions,  a  ( x )  and  a  (x)  are  weakly  consistent. 


Note  that  the  theorem  follows  immediately  from  the  consistency  of  the  standard 
kernel  estimator  for  the  estimator  a .  To  show  weak  consistency  for  a  we  need  to 
show  that  both  the  bias  and  variance  go  to  zero  as  n  goes  to  infinity. 


bias  (a)  =  E a  -  a 


i  "  m  ro  (x)  fx-x-Yl 
‘-11E  h~K  ~iT 

"i.ii.i  L  hi  l  hi  JJ 


2  fr^4^1^r£v-^)]cx(y)rfy-ot(x) 


jti  L  hj  ^  hi 


-  X  Pi  w  a  (>)  -  a  (y)  =  0 

j  =  i 
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by  Bochner’s  Lemma  (Tapia  and  Thompson,  1978),  since  hj  ->  0.  Similarly, 


.  l£„  £  Jx~xi 

Var(a(x))  =  Var  S  -j-K  -J~ 

i=l  \i  =  1  J  \  3 


"  l/Tu.1  hi  ht  ^  hi  )  ^  ht 


-  ii  i  lp-fp-JT<xfX 

y=U=l  ^  '  * 


«;££&  <¥>(¥>»» 


7=  U=  1 


With  a  little  manipulation  we  have 


mm 

Xixl 


;  =  U  =  1  J  K 


m 

sup (K(M  y  1  «w*2-»o 


/I 

j=l  J 
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since  nhj  — >  °°  for  all  j. 


Thm  2:  Under  the  same  conditions  as  theorem  1  and  assuming  the  existence  of  sec¬ 
ond  derivatives  of  a  and  pj,  and  that  the  second  derivative  of  a  is  in  L2,  the  filtered 
kernel  estimator  a  ( x )  is  L2  consistent. 

pf:  Recall  that  the  mean  integrated  squared  error  (MISE)  can  be  written  as 

MISE(a)  =  ^bias2 (a) +Var (a) .  (15) 

So,  we  have 


bias  (a)  =  ^  J^if(^^jpj.(y)a(y)]dy-a(x) 

m 

-  X /[*■(»)  p j  ( x  -  hjt)  a  (x  -  hjt)  ]dt-  a  (*) 


j=  1 


d  -  —  •  (a  (x)  pj  (x) ) }  < 


«  X  J  *  (0  {oWP;W  -thj-~(a(x)Pj(x))  +  ——(a(x)0 .-(*))>  I dt 
i  =  1  L 


k  rn  2 
j  =  1 


(16) 


and  so 


-a(x) 
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h-2  m  m  2  ,2 

j  bias2  (a)  =  tXX^Jxt  (a(x)pj(x))~I(a(x)pk(x))dx.  (17) 

j  =  ik=\  x  x 


Choosing  hj=hk=n'1/2  yields  an  of  order  0(n'2). 
In  the  case  of  the  variance  we  have 


f  m 


Var  (a)  =  -Var 


v  7  =  l 


hJ 


x-y 
h ; 


m 


x 


l 

n 


2 


and  so 


Var  (a) 


(  m 


2^( 


S  =  l 


lz2) 
K  J 


\2 


f(y)dy  +  0(n-') 


Letting 


(18) 


(19) 


g  ( hj ,  hk) 


(20) 


we  have 


i  m  m  s(h.  h  ) 

JVar(6c)=^  £  ^~\?j{y)?k{y)o.{y)dy .  (21) 

7=  \k=  1  j  k 
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Finally,  assume  without  loss  of  generality  that  hj  is  less  than  or  equal  to  hk,  then 


=  hksup(K(t )) 

and  so 


g  (hj,  hk)  <  min  (hj,  hk)  sup  (K  (t) ) ,  (22) 

so  the  integrated  variance  is  of  order  (n  min  ( hk ) ) _1  which  can  be  made  0(n'1/2). 
Combining  equations  (17)  and  (21)  we  have 


' k2  mm2  ,2 

J I  X  hfhl \-h  («  W  P, <*> >  XT 1 <' “ 1 <*>  Pi  W )  *  + 


W/5£  * 


;  =  Ik  =  1 


;£  £^r^Jp>Wp*°')aW<fy 


j  =  1  *  =  1 


Thus,  MISE  0,  at  a  rate  no  worse  than  0(n'1/2).  As  in  the  case  of  the  standard 
kernel  estimator,  the  optimal  rate  is  0(n'4/5). 


ll 


Thm  3:  With  the  same  conditions  as  Theorem  2,  &  is  L2  consistent. 

pf; 

Following  the  calculations  in  theorem  2  and  noting  that  the  pj  are  bounded 
above  by  1,  we  have 

Ji/as2  (oi)*j  J  ]T  hjhk\pj  M  Pit  (*)  (a"  (*) )  (24) 

j  =  u  =  1 

which  is  of  order  0(n‘2). 


For  the  variance,  we  have 


IV  rulvv,(W 


j =  u= 1  •/  * 

which  is  once  again  of  order  ( n  min  (hk) ) _1 .  Thus  we  have  MISE  — »  0  at  a  rate 
of  C^n'1/2),  with  optimal  rate  0(n‘4//5). 

For  the  rest  of  this  paper  we  will  consider  the  estimator  (5),  which  will  be 
referred  to  as  the  FKE.  Note  that  given  any  filter,  for  the  optimal  choice  of  the  hj 
we  have  MISEpxg  <  MISESke-  This  is  a  trivial  consequence  of  the  fact  that  the 
FKE  subsumes  the  SKE,  when  we  take  all  the  hj’s  to  be  equal. 


3.  SPECIAL  CASE:  NORMAL  KERNELS 
We  now  assume  that  the  kernel  K  is  the  standard  normal.  In  this  case  we  can  com- 
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pute  g()  and  obtain 


g(hjthk)  = 


1  hjhk 

Jlnjhj  +  hl 


In  keeping  with  the  ideas  discussed  in  the  introduction,  we  assume  in  this  sec¬ 
tion  that  a  is  a  mixture  of  normals,  and  that  the  filtering  functions  are  generated  by 
the  same  mixture.  Equation  (23)  then  becomes,  using  the  notation  of  equation  (6) 


MISE  - 


k2  m  m 

j  X  £  njnkh]hl\f'j  (*)/'*  (*)  + 

j  =  1  k  =  1 

1  y  "  *,*t  r/,W/tW 


At  this  point  we  introduce  some  notation. 


Ajk  =  (x)f'k(x)dx  , 


/;  (*)/*(*)  , 

=  W  „(vi— rft- 


This  gives  the  equation 


1.2  m  m  mm, 

;=U=1  n*JA,['j=\k=\4nj 


hf  +  hk 
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Taking  the  partial  with  respect  to  hr  we  have 


JL MISE 
dhr 


fyJr  +  WrY.Akrhl- 


k*r 


Equations  (30)  and  (31)  are  used  in  the  next  section  to  compute  the  optimal 
bandwidths  and  MISE  of  a  number  of  examples.  Note  that  this  must  be  done 
numerically,  since  we  do  not  have  a  closed  form  solution  to  the  problem  of  mini¬ 
mizing  (30). 

In  practice  the  true  underlying  mixture  is  not  known,  and  in  fact  the  data  may 
not  come  from  a  mixture  at  all.  In  this  case  it  may  not  be  clear  how  to  apply  the 
above  formulation  and  calculate  the  desired  local  bandwidths,  not  the  least  because 
the  Ajk  and  Bjk  require  a  to  be  a  known  mixture.  We  propose  the  following 
approach  to  this  problem:  we  first  approximate  the  unknown  density  as  a  mixture, 
then  minimize  (30)  to  calculate  the  bandwidths  under  the  assumtion  that  the  filter¬ 
ing  density  is  the  true  density.  Thus  we  use  the  optimal  values  for  hj  under  the 
assumption  that  the  filtering  mixture  is  correct.  This  is  analogous  to  using  a  refer¬ 
ence  density  such  as  a  normal  to  compute  the  bandwidth  for  the  standard  kernel 
estimator.  As  in  the  case  of  the  standard  kernel  estimator,  our  estimate  will  only  be 
optimal  if  the  filtering  mixture  is  indeed  correct,  but  it  will  be  a  useful  estimate  as 
long  as  the  data  is  close  to  the  filtering  mixture. 


4.  EXAMPLES 
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We  compare  the  MISE  of  the  FKE  with  the  standard  kernel  estimator  with  h 
chosen  optimally.  When  simulations  are  performed,  the  bandwidths  are  chosen  by 
numerically  minimizing  (30).  Following  Wand,  Marron,  and  Ruppert  1991,  we 
compute  the  efficiency  of  the  estimator  as  MISEpKg/MISEsKjr  so  small  values  of 
the  efficiency  correspond  to  better  estimates  with  the  FKE. 

Case  1:  Two  means. 

Let 

a(x)  =  In  (0,1)  +±N(m,l)  (32) 

It  is  easy  to  see  that  in  this  case  the  optimal  bandwidth  choice  for  the  FKE  requires 
=  h2  =  h°sPKE,  and  so  MISEfke  =  MISEske  and  the  efficiency  is  1.  This  is  intu¬ 
itively  what  should  happen,  since  the  FKE  is  designed  to  incorporate  differences  in 
variance  of  the  underlying  mixture  components,  and  so  it  will  give  no  improve¬ 
ment  in  cases  where  the  components  differ  only  in  mean. 

Case  2:  Two  variances. 

Let 

a(x)  =  |lV(0, 1)  +^V(0,v)  ,  (33) 

with  .1  <  v  <  10.  Figure  2a  shows  the  efficiency  as  a  function  of  the  variance.  Note 
that  for  v  *  1,  the  FKE  improves  on  the  SKE,  as  one  would  expect.  Figure  2b 
shows  the  two  bandwidths  used  in  the  FKE.  The  bandwidth  associated  with  the 
second  mixture  term,  the  term  for  which  we  vary  v,  dramatically  changes  accord- 
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ing  to  v. 

This  is  essentially  the  case  that  the  FKE  was  designed  to  address.  We  have  a 
density  which  is  a  mixture  of  two  normals  with  unequal  variances.  As  the  variance 
of  the  second  term  is  moved  away  from  the  variance  of  the  first  term,  the  standard 
kernel’s  single  bandwidth  becomes  less  and  less  appropriate  for  the  resulting  den¬ 
sity.  The  filtered  kernel  estimator  allows  us  to  take  the  two  variances  into  account 
in  our  estimator,  thus  improving  the  estimate  when  the  variances  are  significantly 
different. 

Case  3:  Outlier  model 

Let 

a(x)  =  pN(0, 1)  +  (l-p)N(0, 100) ,  (34) 

with  .01  <  p  <  .99.  Figure  3shows  the  efficiency  for  this  model  as  a  function  of  p. 
Again  we  see  that  the  FKE  improves  over  the  standard  kernel  provided  0  <  p  <  1. 
Clearly  the  estimators  are  equal  when  p=0  or  p=l. 

Case  4:  Marron  and  Wand  Densities 

Marron  and  Wand  1992  list  15  normal  mixture  densities  showing  some  of  the 
wide  range  of  variations  that  are  obtainable  with  simple  mixtures.  Table  1  shows 
the  efficiency  of  the  FKE  for  these  densities.  Note  that  the  performance  of  the  FKE 
depends  on  the  amount  of  local  variability  of  the  mixture,  as  would  be  expected. 
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The  above  examples  dealt  with  the  theoretical  properties  of  the  FKE,  where  the 
filter  is  assumed  to  be  equal  to  the  underlying  density.  In  practice  this  is  not  possi¬ 
ble,  and  in  fact  if  the  underlying  density  is  known  any  attempts  at  estimation  are 
obviously  unnecessary.  In  the  next  two  examples  we  consider  the  case  where  the 
underlying  density  is  not  known,  and  in  fact  in  at  least  one  (case  5)  it  is  known  not 
to  be  a  mixture  of  normals  at  all.  In  these  cases  we  first  fit  a  mixture  to  the  data  to 
obtain  a  reasonable  filter.  Then  we  compute  the  hj  under  the  assumption  that  the 
filter  is  equal  to  the  density.  In  practice,  as  will  be  seen  below,  this  provides  a  good 
estimator  provided  the  filtering  mixture  captures  most  of  the  underlying  variability 
of  the  data. 

Case  5:  Lognormal. 

100  data  points  were  drawn  from  a  lognormal  and  a  two  component  mixture 
was  fit  to  the  data  using  the  EM  method  (see,  e.g.,  Titterington,  Smith,  and  Makov 
1985).  The  bandwidths  for  the  filtered  kernel  estimator  were  chosen  assuming  the 
filter  to  be  equal  to  the  true  density.  Thus  we  first  construct  the  mixture  estimate 
and  then  use  the  bandwidths  that  would  be  optimal  for  that  mixture  density,  in 
much  the  same  way  that  one  might  use  a  normal  density  as  a  reference  estimate  for 
the  standard  kernel  estimator. 

Figure  4a  shows  the  density  estimates  for  the  standard  kernel  estimator  and  the 
FKE.  The  bandwidths  for  the  FKE  were  hj  =  .4  and  h2  =  2.2.  The  bandwidth  for 
the  standard  kernel  estimator  was  chosen  by  hand  to  get  a  reasonable  fit  to  the  true 
density.  The  plot  shown  uses  a  value  of  .2  for  h,  which  is  about  five  times  the 
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“optimal”  value  for  a  lognormal  using  100  data  points.  Figure  4b  compares  the 
density  estimates  of  the  FKE  and  the  filtering  mixture. 

Case  6:  Suicide  Data. 

Silverman  1986  uses  a  data  set  of  lengths  of  treatment  spells  in  days  of  control 
patients  in  a  suicide  study  to  illustrate  the  kernel  estimator.  We  fit  two  normals  to 
the  data  using  the  EM  algorithm  and  use  this  as  the  filter  for  the  FKE,  as  was  done 
with  the  lognormal  example.  We  compare  the  estimator  with  the  two  kernel  esti¬ 
mators  Silverman  uses,  bandwidths  =  20  and  60  in  figure  5a.  The  FKE  uses  the 
bandwidths  hj  =  19.17  at  the  mode  and  h2  =  127.17  in  the  tail.  It  is  noteworthy  that 
these  bandwidths  correspond  to  Silverman’s  choices  of  h=20  to  get  pleasing  results 
in  the  mode  at  the  expense  of  tail  smoothness  and  double  Silverman’s  h=60  chosen 
to  yield  a  smooth  tail.  The  FKE  is  compared  with  the  mixture  approximation  in 
Figure  5b.  Note  that  the  FKE  allows  a  good  fit  to  the  mode  while  maintaining  the 
smoothness  of  the  tail.  The  mode  smoothness  can  be  varied  by  varying  the  appro¬ 
priate  bandwidth  (hj)  without  having  much  effect  on  the  fit  to  the  tail,  as  can  be 
seen  by  the  plot  of  the  filtering  functions  in  Figure  5c.  This  figure  makes  clear  the 
local  character  of  the  bandwidths  in  the  FKE.  This  is  also  illustrated  in  Figure  5d, 
where  the  bandwidth  associated  with  the  mode  is  reduced,  making  the  mode  more 
pronounced  and  rough  without  effecting  the  tail  smoothness.  It  is  this  ability  that 
makes  the  FKE  a  very  interesting,  and  we  feel  useful,  estimator. 

5.  CONCLUSIONS 
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The  filtered  kernel  estimator  is  superior  in  performance  to  the  standard  kernel 
estimator,  provided  appropriate  filter  functions  and  bandwidths  can  be  chosen.  In 
section  2  it  was  shown  that  any  filter  functions  will  give  asymptotic  performance 
no  worse  than  the  standard  single  bandwidth  kernel  estimator. 

It  would  seem  at  first  that  the  added  trouble  of  selecting  filtering  functions  and 
bandwidths  would  make  the  estimator  difficult  to  use  in  practice.  However  the  idea 
of  using  a  finite  mixture  fit  to  the  data  to  construct  the  filters  is  one  which  appears 
to  work  well  in  a  variety  of  situations,  even  those  for  which  the  data  is  not  drawn 
from  a  finite  mixture.  The  ability  to  take  local  structure  into  account  is  a  powerful 
one  which  will  allow  much  better  estimates  in  those  situations  where  there  is  rea¬ 
son  to  believe  the  local  structure  is  justified. 

It  should  be  noted  that  bad  filters  produce  bad  FKE’s.  This  is  not  unreasonable, 
however  it  does  mean  that  care  must  be  used  in  the  choice  of  the  filtering  mixture. 
Just  as  the  standard  kernel  estimator  produces  errors  when  the  bandwidth  is  taken 
to  be  too  large  or  too  small,  mixtures  which  have  terms  which  are  not  supported  by 
the  data  will  produce  local  errors  in  the  FKE  estimate.  This  local  character  of  the 
estimator  gives  some  protection,  since  the  effect  of  the  error  is  reduced  outside  the 
region  in  which  the  corresponding  p  dominates.  This  is  in  contrast  to  the  single 
kernel  estimator  where  the  choice  of  the  bandwidth  has  a  global  effect. 

We  have  focused  on  the  univariate  case  in  this  work,  but  other  extensions  are 
possible.  The  multivariate  version  of  the  FKE  has  much  promise,  and  will  be 
addressed  in  the  future.  The  ability  to  effectively  tune  the  kernels  to  the  local  struc¬ 
ture  of  the  data  will  be  a  powerful  and  useful  tool  for  multivariate  density  estima¬ 
tion.  It  is  believed  that  this  ability  to  define  the  structure  locally  will  be  of  use  in 
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exploritory  data  analysis  and  in  discriminant  analysis. 
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Efficiency  results  for  the  FKE  for  the  15  normal  mixture  densities  from  Marron 
and  Wand  1992.  These  densities  show  some  of  the  wide  range  of  variations  that  are 
obtainable  with  simple  mixtures.  The  FKE  is  superior  to  that  of  the  standard  kernel 
estimator  when  there  is  variability  of  the  mixture’s  local  variance  structure. 
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Table  1: 


Density 

Efficiency 

Gaussian 

1 

Skewed  Unimodal 

.60 

Strongly  Skewed 

.38 

Kurtotic  Unimodal 

.44 

Outlier 

.91 

Bimodal 

1 

Separated  Bimodal 

1 

Skewed  Bimodal 

.69 

Trimodal 

.90 

Claw 

.63 

Double  Claw 

.13 

Asymmetric  Claw 

.30 

Asymmetric  Double  Claw 

.15 

Smooth  Comb 

.41 

Discrete  Comb 

.5 
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1  Figure  1 

.  An  example  of  the  filtered  kernel  estimator  applied  to  a  two  component  mix- 

<  ture.  The  mixture  probability  density  function,  weighted  kernels,  and  poste¬ 

rior  p  functions,  are  shown. 
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Efficiency  as  a  function  of  variance  for  case  2: 
a  ( x )  =  -N  (0,  1)  +  (0,  v)  .  As  v  deviates  from  1  the  standard  kernel’s 

single  bandwidth  becomes  less  and  less  appropriate  for  the  resulting  density. 
The  filtered  kernel  estimator  allows  us  to  take  the  two  variances  into  account 
in  our  estimator,  thus  improving  the  estimate  when  the  variances  are  signifi¬ 
cantly  different. 
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Figure  2b 

The  two  bandwidths  used  in  the  FKE.  The  bandwidth  associated  with  the 
second  mixture  term  (solid  line),  the  term  for  which  we  vary  v,  dramatically 
changes  with  v  allowing  the  FKE  to  model  the  local  variance  structure  of  the 
underlying  density. 
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Figure  3 

Efficiency  as  a  function  of  p  for  the  outlier  model  (case  3): 
a(x)  =  pN( 0,  1)  +  (1  -p)N(0,  100) .  As  p  runs  from  .01  to  .99  the  FKE 
improves  over  the  standard  kernel  estimator  because  the  underlying  density 
has  nonconstant  local  variance  structure. 
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Figure  4a 

Density  estimates  for  the  standard  kernel  estimator  and  the  FKE,  along  with 
the  true  lognormal  density  from  which  100  sample  observations  were  drawn 
(case  5).  The  bandwidths  for  the  FKE  are  hj  =  .4  and  h2  =  2.2.  The  bandwidth 
for  the  standard  kernel  estimator  (h=.2)  was  chosen  to  get  a  reasonable  fit  to 
the  true  density. 
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Density  Estimate 


Figure  4b 

The  FKE,  the  filtering  mixture,  and  the  true  lognormal  density  from  case  5. 
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Figure  5  a 

Case  6:  suicide  data  from  Silverman  1986.  A  mixture  of  two  normals  is  used  as 
the  filter  for  the  FKE.  The  FKE  is  compared  with  standard  kernel  estimators 
using  bandwidths  of  20  and  60.  The  FKE  uses  the  bandwidths  hj  =  19.17  at  the 
mode  and  h2  =  127.17  in  the  tail  and  combines  the  features  of  the  two  SKEs  - 
detail  in  the  mode  and  smoothness  in  the  tail. 
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Density  Estimate 


Figure  5b 

Comparison  of  the  FKE  from  Figure  5a  and  its  associated  mixture  approxima¬ 
tion. 
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Figure  5c 

The  filtering  functions  p  defined  by  the  mixture  estimate  shown  in  Figure  5b. 
These  posterior  functions  dictate  the  local  character  of  the  bandwidths  in  the 
FKE,  and  indicate  that  the  local  smoothness  can  be  varied  by  varying  the 
appropriate  bandwidth  without  having  much  effect  on  the  fit  in  other  regions. 
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Figure  5d 

An  example  of  local  selective  tuning.  For  the  FKE  shown  in  Figure  5a  (dashed 
line)  the  bandwidth  associated  with  the  mode  has  been  reduced,  making  the 
mode  more  pronounced  and  rough  without  effecting  the  tail  smoothness.  This 
is  due  to  the  effect  of  the  filtering  functions  shown  in  Figure  5c. 
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